dARTMAP: A Neural Network for Fast Distributed Supervised Learning
نویسندگان
چکیده
Distributed coding at tbe hidden layer of a multi-layer perceptron (MLP) endows the network with memory compression and noise tolerance capabilities. However, an MLP typically requires slow off-line learning to avoid catastrophic forgetting in an open input environment. An adaptive resonance theoty (ART) model is designed to guarantee stable memories even with fast online learning, However, ART stability typically requires winner-take-all coding, which may cause category proliferation in a noisy input environment. Distributed ARTMAP (dARTMAP) seeks to combine the computational advantages of MLP and ART systems in a real-time neural network for supervised learning, An implementation algorithm here describes one cla~s of dARTMAP networks. This system incorporates elements of the unsupervised dART model as well as new features, including a content-addressable memory (CAM) rule for improved contrast control at the coding field. A dARTMAP system reduces to fuzzy ARTMAP when coding is winner-take-alL Simulations show tbat dARTMAP retains fuzzy ARTMAP accuracy while significantly improving memory compression. Keywords-Distributed ARTMAP, Adaptive Resonance, ART, ARTMAP, Distributed Coding, Fast Learning, Supervised Learning, Neural Network Distributed AR TMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-026 3 1. DISTRIBUTED CODING BY ADAPTIVE RESONANCE SYSTEMS Adaptive resonance theory (ART) began with an analysis of human cognitive information processing (Grossberg, 1976, 1980). Fundamental computational design goals have therefore always included memory stability with fast or slow learning in an open and evolving input environment. As a realtime model of dynamic processes, an ART network is characterized by a system of ordinary differential equations, which are approximated by an algorithm for implementation purposes. In a general ART system, an input is presumed to generate a characteristic pattern of activation, or spatial code, that may be distributed across many nodes in a field representing a brain region such as the inferior temporal cortex (e.g., Miller, Li, & Desimone, 1991). While ART code representations may be distributed in theory, in practice nearly all ART networks feature winner-take-all (WTA) coding. These systems include ART 1 (Carpenter & Grossberg, 1987) and fuzzy ART (Carpenter, Grossberg, & Rosen, 1981), for unsupervised learning, and ARTMAP (Carpenter, Grossberg, & Reynolds, 1991) and fuzzy ARTMAP (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992), for supervised learning. The coding field of a supervised system is analogous to the hidden layer of a multi-layer perceptron (MLP) (Rosenblatt, 1958, 1962; Rumelhart, Hinton, & Williams, 1986; Werbos, 1974), where distributed activation helps the network achieve memory compression and generalization. However, an MLP employs slow learning, which limits adaptation for each input and so requires multiple presentations of the training set. With fast learning, where dynamic variables are allowed to converge to asymptote on each input presentation, MLP memories suffer catastrophic forgetting. However, features of a fast-learn system, such as its ability to encode significant rare cases and to learn quickly in the field, may be essential for a given application domain. Additional ART capabilities, including stable coding and scaling to accommodate large databases, are also essential for many applications, such as the Boeing parts design retrieval system (Caudell, Smith, Escobedo, & Anderson, 1994). An overall aim of the distributed ART (dART) research program is to combine the computational advantages of ART and MLP systems. Desirable properties include code stability when learning is fast and on-line, memory compression when inputs are noisy and unconstrained, and real-time system dynamics. 1.1 Distributed Learning A key step in the derivation of the first family of dART models (Carpenter, 1996, 1997) was the specification of dynamic learning laws for stable distributed coding. These laws generalize the instar (Grossberg, 1972) and outstar (Grossberg, 1968, 1970) laws used, for example, in fuzzy ART. Instar and outstar learning features a gating operation that permits weight change only when a coding node is active. This property is critical to ART stability. With a distributed code and fast learning, however, instar and outstar dynamics cause catastrophic forgetting. A system such as Gaussian ARTMAP (Williamson, 1996) includes many features of a distributed coding network, but retains the instar and outstar learning laws of earlier ART and ARTMAP models. The weight update rules in a Gaussian ARTMAP algorithm therefore approximate a real-time system only in the slowlearn limit. Other ARTMAP variations, such as ART-EMAP (Carpenter & Ross, 1995) and ARTMAP-IC (Carpenter & Markuzon, 1998) acquire some of the advantages of distributed coding but sidestep the learning problem by permitting distributed activation during testing only. The distributed ins tar (Carpenter, 1997) and distributed outs tar (Carpenter, 1994) laws used in dART dynamically apportion learned changes according to the degree of activation of each coding Distributed AR TMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-D26 4 node, with fast as well as slow learning. The update rules listed in the dARTMAP implementation algorithm represent exact, closed form solutions of the model differential equations. These solutions are valid across all time scales, with fast or learning. When coding is WTA, the distributed learning laws reduce to ins tar and outstar equations, and dART reduces to fuzzy ART. Similarly, with coding that is WTA during training but distributed during testing, the dARTMAP algorithm specified here reduces to ARTMAP-IC, and further reduces to fuzzy ARTMAP with coding that is WTA during both testing and training. 1.2 dARTMAP Design Choices An ART module is embedded as the primary component of ARTMAP, and similarly an unsupervised dART module is embedded in a supervised dARTMAP network. In applications, ARTMAP requires few design choices: the number of coding nodes is determined by on-line performance, and the default network parameters work well in most settings. In contrast, a general dARTMAP system presents the user with a far greater array of choices, due to the new degrees of freedom afforded by distributed code possibilities. In practice, a number of the "obvious" design choices have failed to produce good perfmmance in simulation studies. The present article presents one family of dARTMAP networks that have performed well in pilot studies. In particular, dARTMAP retains fuzzy ARTMAP test set accuracy while significantly reducing network size. A self-contained dARTMAP algorithm is designed both to expedite ready implementation and to foster the development of alternative designs adapted to the demands of new applications. 1.3 Outline A number of computational devices that were not part of the more general distributed ART theory were found to be useful in dARTMAP simulations. These include a new rule characterizing the content-addressable memory stored at the coding field in response to a given input (Section 2.1), an internal control device that causes the system to alternate between distributed and winner--take-all coding modes (Section 2.2), and credit assignment and instance counting (Section 2.3). A geometric representation aids the visualization of distributed ARTMAP computational dynamics. Since the algorithm reduces to fuzzy ARTMAP when coding is winner-take-all, the geometric characterization of dARTMAP builds upon the geometry of fuzzy ARTMAP, which represents weight vectors as category boxes in input space (Section 3.1). The relationship between these boxes and a system input determines the order in which categories are searched (Section 3.2), and box expansion represent~ weight changes during winner-take-all learning (Section 3.3). Distributed ARTMAP replaces the long-term memory weight~ of fuzzy ARTMAP with dynamic weight~. which depend on short-term memory coding node activations as well as long-term memory (Section 4.1). The corresponding geometric representation replaces each fuzzy ARTMAP category box with a nested family of boxes, one for each coding node activation value (Section 4.2). Some or all of these coding boxes may expand during dARTMAP learning, but the geometry shows how the system preserves dynamic range with fast as well a~ slow learning (Section 4.3). The rule in the dARTMAP algorithm that characterizes the signal transmitted to the coding field in response to a given input admits a geometric interpretation (Section4.4), as does the rule characterizing the response of the content-addressable memory to the incoming signal (Section 4.5). The dARTMAP algorithm includes the computational element~ that were useful in simulation studies. For clarity, the training (Section 5.1) and testing (Section 5.2) portions of the algorithm Dislributed AR TMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-026 5 are listed separately. In the version presented here, the dARTMAP algorithm is feedforward during testing. A series of simulations indicate how the dARTMAP algorithm works. Distributed prediction in the basic algorithm reduces network size, but this system uses only binary connections from the coding field to the output field (Section 6.1). Performance can be improved by augmenting the trained dARTMAP system with a linear output map such as Adaline (Section 6.2). Other simulations analyze the role of dARTMAP learning that takes place in the distributed mode, as opposed to the winner-take-all mode (Section 6.3). By varying the degree of pattern contrast in the contentaddressable memory system, dARTMAP performance can be improved, without increasing network size (Section 6.4 ). A statistical analysis confirms the significance of simulation findings (Section 6.5). Finally, a step-by-step presentation of the geometry of dARTMAP learning demonstrates the ·detailed mechanisms of system dynamics (Section 7). Section 8 concludes with a discussion of possible dARTMAP variations and directions for future research. 2. CAM RULES, CODING MODES, AND CREDIT ASSIGNMENT The unsupervised distributed ART network (Carpenter, 1996, 1997) features a number of innovations that differentiate it from previous ART networks, including a new architecture configuration and distributed instar and outstar learning laws (Figure I). In order to stabilize fast learning with distributed codes, dART represents the unit of long-term memory (LTM) as a subtractive threshold rather than a traditional multiplicative weight. Despite their different architectures, a dART algorithm reduces to fuzzy ART when coding is winner-take-all. While a dART module is the basic component of a supervised dARTMAP system, the algorithm specified in Section 5 also employs additional devices not included in the previous distributed ART description. These features, including a new rule defining coding field activation, alternation between WT A and disu·ibuted coding modes, and credit assignment, will now be described. Figure 1: Distributed ART network 2.1 Increased Gradient CAM Rule A neural network field of strongly competitive nodes can, once activated by an initial input, maintain a short-term memory (STM) activation pattern even after the input is removed. A new input then requires some active reset process before it can instate a different code, or contentaddressable memory (CAM). A CAM rule specifies a function that characterizes the steady-state STM response to a given vector of inputs converging upon a field of neurons. Traditional CAM rules include McCulloch-Pitt~ activation, which makes STM proportional to input (McCulloch & Pitts, 1943); a power rule, which makes STM proportional to input raised to a power p; and a WTA rule, which concentrates all activation at the node receiving the largest net input. Other CAM rules include Gaussian activation functions, as used, for example, in radial basis function networks (Moody & Darken, 1989). A power rule reduces to a McCulloch-Pitts rule when p = 1 and converges to a WTA rule as p---> oo. Moving p from 0 toward infinity produces a stored STM pattern that is a progressively contrast-enhanced transformation of the input vector. In many examples, however, a power rule is problematic because differences among input components are small. A CAM system may then require unreasonably large powers p to produce significant differences among STM activations. Distributed ARTMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-026 6 The CAM rule used in the dARTMAP algorithm is designed to enhance input differences as represented in the distributed internal code without raising input component~ to high powers. It is therefore called the increased gradient CAM rule. Beyond its role in the present system, this rule is useful for defining the steady-state activation function in other neural networks. The increased gradient rule includes a power p for contrast control. The role of p is analogous to the role of variance in Gaussian activation functions (Hertz, Krogh, & Palmer, 1991; Moody & Darken, 1989). A geometric representation of dARTMAP provides a natural interpretation of the increased gradient CAM rule (Section 4.5). 2.2 Distributed and Winner-take-all Coding Modes The CAM rule solves a pattern separation problem that often arises in neural systems, where each element has a limited dynamic range. A second common problem is how to choose the size of a neural network. In a multi-layer perceptron, for example, deciding on the number of hidden units is a critical design choice. With WTA coding, ARTMAP determines network size by adding category nodes incrementally, to meet the demands of on-line predictive accuracy. Some types of MLP networks have also been designed to add hidden units incrementally. A cascade correlation architecture, for example, creates a hierarchy of single-unit hidden layers until the error criterion is met (Fahlman & Lebiere, 1990), but weights in all lower layers are frozen during learning associated with the top layer. With distributed coding, a dARTMAP network could, in principle, operate with a field of coding nodes that are fixed a priori. In practice, this type of network did not produce satisfactory results in simulation studies, where fast learning tended to make the learned representations too uniform. To solve this problem, the dARTMAP algorithm alternates between distributed and winner-take-all coding modes, as follows. Each dARTMAP input first activates a distributed code. If this code produces a correct prediction, learning proceeds in the distributed coding mode. If the prediction is incorrect, the network resets the active code via ARTMAP match tracking feedback (Carpenter, Grossberg, & Reynolds, 1991). In ARTMAP networks, the reset process lliggers a search for a categmy node that can successfully code the cun·ent input. In dARTMAP, reset also places the system in a WTA coding mode for the duration of the search. The switch from a distributed mode to a WT A mode could be implemented in a competitive network by means of a nonspecific signal that increases the strength of intrafield inhibition (Eilias & Grossberg, 1975; Grossberg, 1973). Such an arousal signal might be interpreted as an increase in overall attentiveness in response to an error signal or alarm, the computational result being a sharpened focus on the most salient input features. In WTA mode, dARTMAP can, like ARTMAP, add nodes incrementally as needed. When a coding node is added to the network, it becomes permanently associated with the output class that is active at the time. From then on, the network predicts this class whenever the same coding node is chosen in WT A mode. In distributed mode, STM activations across all nodes that project to a given output class provide evidence in favor of that outcome. Despite it~ computational advantages, the winner-take-all possibility implies that dARTMAP coding is not fully distributed all the time, indicating one possible direction for future system modifications. 2.3 Credit Assignment, Instance Counting, and Match Tracking When a dARTMAP network makes a disttibuted prediction, some of the active coding nodes may be linked to an incmTect outcome. In a real-time network, a feedback loop for credit assignment would suppress activation in these nodes during training (Figure 2). Credit assignment allows learning to Distributed AR TMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-026 7 enhance only those portions of an active code that are associated with the correct outcome. This procedure is similar to credit assignment algorithms widely used in other neural networks (e.g., Williamson, 1996) and genetic algorithms (e.g., Booker, Goldberg, & Holland, 1989). Figure 2: Distributed ARTMAP network The current simulations were also found to benefit from design features used in the ARTMAP-IC network. These include instance counting of category exemplars and the MTmatch tracking search rule. Instance counting biases output predictions according to previous coding node activations summed over training set inputs. The MTsearch rule generally improves memory compression compared to the original ARTMAP match tracking algorithm (MT + ). It also permits a system to encode inconsistent cases, where two identical training set inputs are associated with different outcomes. Inconsistent cases are common in medical databases, for example. Aspects of the dARTMAP algorithm such as the increased gradient CAM rule, the combination of WTA with distributed coding during training, credit assignment, and instance counting are not necessarily fundamental principles intrinsic to the class of all dARTMAP networks. Rather, they are developed for the pragmatic purpose of defining one set of dARTMAP systems with the desired computational properties. A real-time neural network can implement the computations of the dARTMAP algorithm (Figure 2). Because the algorithm considers only the case where the output vector represents discrete classes, it does not require all the variables shown in the network diagram. A geometric representation of dARTMAP dynamics (Section 4) helps visualize and motivate computations of the algorithm (Section 5). Because dARTMAP reduces to fuzzy ARTMAP when coding is WTA, the geometry of dARTMAP generalizes the geometry of fuzzy ARTMAP, which will first be reviewed (Section 3). Where possible, dARTMAP retains fuzzy ARTMAP notation as well. 3. FUZZY ARTMAP GEOMETRY Both fuzzy ARTMAP and distributed ARTMAP employ an input preprocessing device called complement coding. Complement coding creates a system input vector A equal to the concatenation of the original M -dimensional input a, where 0 S a1 S 1; and its complement ac, where (a c) 1 = (1a1). The input A thus positively represents both "present features" (a) and "absent features" ( ac ). In addition, using the city block norm defined by !vi= I,lv;!, complement coding 2M M serves to normalize inputs, since then \AI= l:A; = l:(a1 +(1-a1 )) = M. Complement coding i=l i=l allows weight vectors to be represented geometrically as boxes in the M -dimensional space of the vector a. The doubled system input vectors A produce the endpoints of a set of intervals that define the edges of each box, as described in the following section. 3.1 ARTMAP Category Boxes During fuzzy ARTMAP learning, 2M-dimensional complement coded inputs A give rise to 2Mdimensional weight vectors w J = ( w11 ... wij ... w2M,J), one for each F2 category node j (Figure 3). Bottom-up wcighL~ equal top-down weights, so w; may stand for both. For i = l ... M, Dislributed AR TMAP Carpenter, Milenova, & Noeske Technical Report CAS/CNS TR-97-026 8 weight wij intuitively represents the degree to which the i1h feature is consistently present in the inputs a coded by the /'' category; and wi+M,j represents the degree to which the ih feature is consistently absent. When both Wzj and w;+M,j become small, the network treats the size of a; as unpredictive with respect to the /h category. Figure 3: Simplified fuzzy ARTMAP network The weight vector w j is depicted geometrically as an M -dimensional category box Rj with edges defined by the intervals [wij• wf+M,j]. The box Rj is the set of points q for which wu s q; s (1-wi+M,j) (Figure 4a). The size IRjl is defined as the sum of the lengths of the box's M M defining intervals. Thus IRjl = 1:((1wi+M,j )wu) = M -lw jl· i=l Figure 4: Fuzzy ARTMAP geometry When a node is first activated, or committed, the active node (j = J) becomes permanently associated with the active output class ( k = K = K'( J)). The network adds a committed node when it determines that previously active nodes cannot adequately represent the current input. The number of committed nodes (C) grows incrementally during training. When a node j is uncommitted, wij = 1. Then, when the node first becomes committed, Rj equals the point box {a}, where c wij = wi+M,j =a; (i = 1. .. M) (Figure 4b). 3.2 ARTMAP Order of Search When a committed F 2 node becomes active and incorrectly predicts the output class, fuzzy ARTMAP triggers a search process called match tracking. Match tracking increases the vigilance matching parameter p just enough to reset the active category. Search ends when the chosen node predicts the correct output class k = K, provided that the vigilance matching criterion is also satisfied. Fuzzy ARTMAP geometty serves to illustrate the order in which nodes are searched. Let Tj denote the signal sent to the /'' node of the category field F2. The function that determines Tj depends jointly on the current input a and on the learned weight vector w j. With WTA coding, F 2 nodes become active in order of the size of Tj, starting with the largest. The geometric version of a choice-by-difference signal function (Carpenter & Gjaja, 1994) sets:
منابع مشابه
Distributed ARTMAP: a neural network for fast distributed supervised learning
Distributed coding at the hidden layer of a multi-layer perceptron (MLP) endows the network with memory compression and noise tolerance capabilities. However, an MLP typically requires slow off-line learning to avoid catastrophic forgetting in an open input environment. An adaptive resonance theory (ART) model is designed to guarantee stable memories even with fast on-line learning. However, AR...
متن کاملDistributed ARTMAP
Distributed coding at the hidden layer of a multi–layer perceptron (MLP) endows the network with memory compression and noise tolerance capabilities. However, an MLP typically requires slow off–line learning to avoid catastrophic forgetting in an open input environment. An adaptive resonance theory (ART) model is designed to guarantee stable memories even with fast on–line learning. However, AR...
متن کاملA Neuro-Fuzzy System that Uses Distributed Learning for Compact Rule Set Generation
ARTMAP based architectures have several desirable properties that make them very suitable for pattern classification problems. However, they suffer from category proliferation. Distributed coding has been proposed as a solution for memory compression. dARTMAP neural network has been introduced as a modification of Fuzzy ARTMAP that, due to distributed learning, achieves code compression while f...
متن کاملStudy of distributed learning as a solution to category proliferation in Fuzzy ARTMAP based neural systems
An evaluation of distributed learning as a means to attenuate the category proliferation problem in Fuzzy ARTMAP based neural systems is carried out, from both qualitative and quantitative points of view. The study involves two original winner-take-all (WTA) architectures, Fuzzy ARTMAP and FasArt, and their distributed versions, dARTMAP and dFasArt. A qualitative analysis of the distributed lea...
متن کاملA METAHEURISTIC-BASED ARTIFICIAL NEURAL NETWORK FOR PLASTIC LIMIT ANALYSIS OF FRAMES
Despite the advantages of the plastic limit analysis of structures, this robust method suffers from some drawbacks such as intense computational cost. Through two recent decades, metaheuristic algorithms have improved the performance of plastic limit analysis, especially in structural problems. Additionally, graph theoretical algorithms have decreased the computational time of the process impre...
متن کامل